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VARIABLE SELECTION USING MM ALGORITHMS 

By David R. Hunter and Runze Li^ 

Pennsylvania State University 

Variable selection is fundamental to high-dimensional statistical 
modeling. Many variable selection techniques may be implemented 
by maximum penalized likelihood using various penalty functions. 
Optimizing the penalized likelihood function is often challenging be- 
cause it may be nondifferentiable and/or nonconcave. This article 
proposes a new class of algorithms for finding a maximizer of the pe- 
nalized likelihood for a broad class of penalty functions. These algo- 
rithms operate by perturbing the penalty function slightly to render 
it differentiable, then optimizing this differentiable function using a 
minorize-maximize (MM) algorithm. MM algorithms are useful ex- 
tensions of the well-known class of EM algorithms, a fact that allows 
us to analyze the local and global convergence of the proposed algo- 
rithm using some of the techniques employed for EM algorithms. In 
particular, we prove that when our MM algorithms converge, they 
must converge to a desirable point; we also discuss conditions under 
which this convergence may be guaranteed. We exploit the Newton- 
Raphson-like aspect of these algorithms to propose a sandwich esti- 
mator for the standard errors of the estimators. Our method performs 
well in numerical tests. 

1. Introduction. Fan and Li [7] discuss a family of variable selection 
methods that adopt a penalized likelihood approach. This family includes 
well-established methods such as AIC and BIC, as well as more recent meth- 
ods such as bridge regression [11], LASSO [23] and SCAD [2, 7]. What all of 
these methods share is the fact that they require the maximization of a pe- 
nalized likelihood function. Even when the log-likelihood itself is relatively 
easy to maximize, the penalized version may present numerical challenges. 
For example, in the case of SCAD or LASSO or bridge regression, the pe- 
nalized log-likelihood function is nondifferentiable; with SCAD or bridge 
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regression, the function is also nonconcave. To perform the maximization, 
Fan and Li [7] propose a new and generic algorithm based on local quadratic 
approximation (LQA). In this article we demonstrate and explore a connec- 
tion between the LQA algorithm and minorization-maximization (MM) al- 
gorithms [14], which shows that many different variable selection techniques 
may be accomplished using the same algorithmic techniques. 

MM algorithms exploit an optimization technique that extends the central 
idea of EM algorithms [6] to situations not necessarily involving missing data 
nor even maximum likelihood estimation. The connection between LQA and 
MM enables us to analyze the convergence of the local quadratic approxi- 
mation algorithm using techniques related to EM algorithms [16, 19, 20, 24]. 
Furthermore, we extend the local quadratic approximation idea here by 
forming a slightly perturbed objective function to maximize. This pertur- 
bation solves two problems at once. First, it renders the objective function 
differentiable, which allows us to prove results regarding the convergence of 
the MM algorithms discussed here. Second, it repairs one of the drawbacks 
that the LQA algorithm shares with forward variable selection: namely, if a 
covariate is deleted at any step in the LQA algorithm, it will necessarily be 
excluded from the final selected model. We discuss how to decide a priori 
how large a perturbation to choose when implementing this method and 
make specific comments about the price paid for using this perturbation. 

The new algorithm we propose retains virtues of the Newton-Raphson 
algorithm, which among other things allows us to compute a standard error 
for the resulting estimator via a sandwich formula. It is also numerically 
stable and is never forced to delete a covariate permanently in the pro- 
cess of iteration. The general convergence results known for MM algorithms 
imply among other things that the newly proposed algorithm converges cor- 
rectly to the maximizer of the perturbed penalized likelihood whenever this 
maximizer is the unique local maximum. The linear rate of convergence of 
the algorithm is governed by the largest eigenvalue of the derivative of the 
algorithm map. 

The rest of the article is organized as follows. Section 2 briefly introduces 
the variable selection problem and the penalized likelihood approach. After 
providing some background on MM algorithms. Section 3 explains their 
connection to the LQA idea, then provides a modification to LQA that may 
be shown to be an MM algorithm for maximizing a perturbed version of 
the penalized likelihood. Various convergence properties of this new MM 
algorithm are also covered in Section 3. Section 4 describes a method of 
estimating covariances and presents numerical tests of the algorithm on a 
set of four diverse problems. Finally, Section 5 discusses the numerical results 
and offers some broad comparisons among the competing methods studied 
in Section 4. Some proofs appear in the Appendix. 
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2. Variable selection via maximum penalized likelihood. Suppose that 
{(xj, yj) : i = 1, . . . , n} is a random sample with conditional log-likelihood 
0) = £j(xf /3, yj, 0) given Xj. Typically, the yt are response variables 
that depend on the predictors Xj through a linear combination x?'/^, and 
is a dispersion parameter. Some of the components of (3 may be zero, which 
means that the corresponding predictors do not influence the response. The 
goal of variable selection is to identify those components of /3 that are zero. 
A secondary goal in this article will be to estimate the nonzero components 
of /3. 

In some variable selection applications, such as standard Poisson or lo- 
gistic regression, no dispersion parameter </> exists. In other applications, 
such as linear regression, cf) is to be estimated separately after /3 is esti- 
mated. Therefore, the penalized likelihood approach does not penalize so 
we simplify notation in the remainder of this article by eliminating explicit 
reference to cj). In particular, ii{(3,c()) will be written ^j(/3). This is standard 
practice in the variable selection literature; see, for example, [7, 11, 21, 23]. 

Many variable selection criteria arise as special cases of the general for- 
mulation discussed in [7], where the penalized likelihood function takes the 
form 

n d d 

(2.1) Q{f3)=Y,im-nY.^mm)^m-nY.^jPjm)- 

1=1 j=i j=i 

In (2.1), the Pj{-) are given nonnegative penalty functions, d is the dimension 
of the covariate vector Xj and the Xj are tuning parameters controlling model 
complexity. The selected model based on the maximized penalized likelihood 
(2.1) satisfies Pj = for certain /3j's, which accordingly are not included in 
this final model, and so model estimation is performed at the same time as 
model selection. Often the Xj may be chosen by a data-driven approach such 
as cross-validation or generalized cross-validation [5]. 

The penalty functions Pj{-) and the tuning parameters Xj are not neces- 
sarily the same for all j. This allows one to incorporate hierarchical prior 
information for the unknown coefficients by using different penalty functions 
and taking different values of Xj for the different regression coefficients. For 
instance, one may not be willing to penalize important factors in practice. 
For ease of presentation, we assume throughout this article that the same 
penalization is applied to every component of (3 and write XjPj{\Pj\) as 
Px{\Pj\), which implies that the penalty function is allowed to depend on A. 
Extensions to situations with different penalty functions for each component 
of f3 do not involve any extra difficulties except more tedious notation. 

Many well-known variable selection criteria are special cases of the pe- 
nalized likelihood of (2.1). For instance, consider the Lq penalty Pa(|/5|) = 
0.5A^ X / 0), also called the entropy penalty in the literature, where /(•) 



4 



D. R. HUNTER AND R. LI 



is an indicator function. Note that the dimension or the size of a model equals 
^jlilPjl 7^ 0), the number of nonzero regression coefficients in the model. 
In other words, the penalized likelihood (2.1) with the entropy penalty can 
be rewritten as 

£{(3)-0.5nX'^\M\, 

where \M\ = J2j 0)i the size of the underlying candidate model. 

Hence, several popular variable selection criteria can be derived from the 
penalized likelihood (2.1) by choosing different values of A. For instance, 
the AIC (or Cp) and BIC criteria correspond to A = \/2/n and \/ (log n)/n, 
respectively, although these criteria were motivated from different principles. 
Similar in its effect to the entropy penalty function is the hard thresholding 
penalty function (see [1]) given by 

P,(|/3|)=A2-(|/3|-A)2l(|/3|<A). 

Recently, many authors have been working on penalized least squares with 
the Lg penalty Pa(|/?|) = •^l/?!'''- Indeed, bridge regression is the solution of 
penalized least squares with the Lq penalty [11]. It is well known that ridge 
regression is the solution of penalized likelihood with the L2 penalty. The 
Li penalty results in LASSO, proposed by Tibshirani [23]. Finally, there is 
the smoothly clipped absolute deviation (SCAD) penalty of Fan and Li [7]. 
For fixed a > 2, the SCAD penalty is the continuous function px{-) defined 
by p;^(0) =0 and, for /? 7^ 0, 

(2.2) p'M) = Xm < A) + i^^^ffit > A), 

a — 1 

where throughout this article p^(|/3|) denotes the derivative oi px{-) evalu- 
ated at \f5\. 

Letting p';!^(|/?|+) denote the limit of p'x{x) as x ^ \f3\ from above, the 
MM algorithms introduced in the next section are shown to apply to any 
continuous penalty function px{P) that is nondecreasing and concave on 
(0,00) such that p'x{0+) < 00. The previously mentioned penalty functions 
that satisfy these criteria are hard thresholding, SCAD, LASSO and Lq 
with < g < 1. Therefore, the methods presented in this article enable a 
wide range of variable selection algorithms. 

Nonetheless, there are some common penalty functions that do not meet 
our criteria. The entropy penalty is excluded because it is discontinuous, 
and in fact maximizing the AIC- or BIC-penalized likelihood in cases other 
than linear regression often requires exhaustive fitting of all possible models. 
For q> 1 the Lq penalty is excluded because it is not concave on (0,oo); 
however, the fact that pa(|/3|) = IPl'' is everywhere differentiable suggests 
that the penalized likelihood function may be susceptible to gradient-based 
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methods, and hence alternatives such as MM may be of hmited value. In 
particular, the special case of ridge regression (q = 2) admits a closed-form 
maximizer, a fact we allude to following (3.17). But there is a more subtle 
reason for excluding Lg penalties with q> 1. Note that p'^^{0+) > for any 
of our nonexcluded penalty functions. As Fan and Li [7] point out, this fact 
(which they call singularity at the origin because it implies a discontinuous 
derivative at zero) ensures that the penalized likelihood has the sparsity 
property: the resulting estimator is automatically a thresholding rule that 
sets small estimated coefficients to zero, thus reducing model complexity. 
Sparsity is an important property for any penalized likelihood technique 
that is to be useful in a variable selection setting. 

3. Maximized penalized likelihood via MM. It is sometimes a challeng- 
ing task to find the maximum penalized likelihood estimate. Fan and Li 
[7] propose a local quadratic approximation for the penalty function: Sup- 
pose that we are given an initial value U (3f^ is very close to 0, then 

set Pj = 0; otherwise, the penalty function is locally approximated by a 
quadratic function using 

^Pj \f3j '\ 

when /jf'^ 7^ 0. In other words. 



(3.1) Px{\Pj\) -p\{\Pj '\) + — (0)^ - 



for Pj ~ Pj^'^ . With the aid of this local quadratic approximation, a Newton- 
Raphson algorithm (for example) can be used to maximize the penalized 
likelihood function, where each iteration updates the local quadratic ap- 
proximation. 

In this section, we show that this local quadratic approximation idea is an 
instance of an MM algorithm. This fact enables us to study the convergence 
properties of the algorithm using techniques applicable to MM algorithms in 
general. Throughout this section we refrain from specifying the form of px(-), 
since the derivations apply equally to any one of hard thresholding, LASSO, 
bridge regression using Lg with < q <1, SCAD or any other method with 
penalty function px{-) satisfying the conditions of Proposition 3.1. 

3.1. Local quadratic approximation as an MM algorithm. MM stands for 
majorize-minimize or minor ize-maximize, depending on the context [14]. 
EM algorithms [6], in which the E-step may be shown to be equivalent to a 
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minorization step, are the most famous examples of MM algorithms, though 
there are many examples of MM algorithms that involve neither maximum 
likelihood nor missing data. Heiser [12] and Lange, Hunter and Yang [17] 
give partial surveys of work in this area. The apparent ambiguity in allowing 
MM to have two different meanings is harmless, since any maximization 
problem may be viewed as a minimization problem by changing the sign of 
the objective function. 

Consider the penalty term —'nJ2jPxi\f^j\) of (2.1), ignoring its minus sign 
for the moment. Mimicking the idea of (3.1), we define the function 

(3.2) $.o(^)=Pa(|^o|) + ^'"^°^^'^^''°'+^ 



2|0o| 

We assume that p\{-) is piecewise differentiable so that p^(|^|+) exists for 
all 9. Thus, $6*0 (^) is a well-defined quadratic function of 6 for all real 6*0 
except for = 0- Section 3.2 remedies the problem that $6»o(") is undefined 
when ^0 = 0. 

We are interested in penalty functions p\{9) for which 

(3.3) ^e^{e) > px{\e\) for aU 6 with equality when 6 = Oq. 

A function <l>6»o(^) satisfying condition (3.3) is said to majorize Pa(|^|) at 
^0- If the direction of the inequality in condition (3.3) were reversed, then 
^Qq{9) would be said to minorize p\{\9\) at ^o- 

The driving force behind an MM algorithm is the fact that condition (3.3) 
implies 

^eo{0)-^eAOo)>Pxm)-Px{\Oo\), 
which in turn gives the descent property 

(3.4) ^eo{0)<^eo{Oo) implies Pa(I^I) < Pa(I^oI). 

In other words, if Oq denotes the current iterate, any decrease in the value 
of (^) guarantees a decrease in the value of px{\6\). If 9k denotes the 
estimate of the parameter at the kth iteration, then an iterative minimiza- 
tion algorithm would exploit the descent property by constructing the ma- 
jorizing function <I>£)^, (0), then minimizing it to give 9^+1 — hence the name 
"majorize-minimize algorithm." Proposition 3.1 gives sufficient conditions 
on the penalty function px{-) in order that ^e^-,{9) majorizes px{\9\). Sev- 
eral different penalty functions that satisfy these conditions are depicted in 
Figure 1 along with their majorizing quadratic functions. 

Proposition 3.1. Suppose that on (0, oo) px{-) is piecewise differen- 
tiable, nondecreasing and concave. Furthermore, px{') is continuous at 
and p'^{0+) < oo. Then for all 9q ^ 0, ^0^-,{9) as defined in (3.2) majorizes 
Pxi\&\) the points ±|0o|- In particular, conditions (3.3) and (3.4) hold. 





(c) (d) 

Fig. 1. Majorizing functions ^eo{0) for various penalty functions are shown as dotted 
curves; the penalty functions are shown as solid curves. The four penalties are (a) hard 
thresholding with X = 2; (b) Li with X = 1; (c) Lo.5 with X = 1; and (d) SCAD with a = 2.1 
and A = 1. In each case 6o — 1. 



Next, suppose that we wish to employ the local quadratic approximation 
idea in an iterative algorithm, where /3^'^^ = {Pi'^ P^t^) denotes the value 
of P at the kth iteration. Appending negative signs to Px{\l3j\) and ^ (k){(3j) 

to turn majorization into minorization, we obtain the following corollary 
from Proposition 3.1 and (2.1). 

Corollary 3.1. Suppose that f3j^^ ^ for all j and that p\{0) satisfies 
the conditions given in Proposition 3.1. Then 

d 

(3.5) Skm^m-nY.'^^i.){f3j) 

i=i ' 

mmorizes 
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By the ascent property — the analogue for minorizing functions of the de- 
scent property (3.4) — Corollary 3.1 suggests that given (3^^\ we should de- 
fine /3(*^+^) to be the maximizer of Sk{(3), thereby ensuring that 
Q(/3 ('')). The benefit of replacing one maximization problem by another in 
this way is that Sk{(3) is susceptible to a gradient-based scheme such as 
Newton-Raphson, unlike the nondifferentiable function Q{P). Since the sum 
in (3.5) is a quadratic function of /3 — in fact, the Hessian matrix of this sum 
is a diagonal matrix — this sum presents no difficulties for maximization. 
Therefore, the difficulty of maximizing Sk{P) is determined solely by the 
form of For example, in the special case of a linear regression model 

with normally distributed errors, the log-likelihood function ^(/3) is itself 
quadratic, which implies that Sk{(3) may be maximized analytically. 

If some of the components of (3^^^ equal zero (or in practice, if some of 
them are close to zero), the algorithm proceeds by simply setting the final 
estimates of those components to be zero, deleting them from consideration, 
then defining the function Sk{(3) as in (3.5), where is the vector composed 
of the nonzero components of (3. The weakness of this scheme is that once 
a component is set to zero, it may never reenter the model at a later stage 
of the algorithm. The modification proposed in Section 3.2 eliminates this 
weakness. 

3.2. An improved version of local quadratic approximation. The draw- 
back of ^g^^{9) in (3.2) is that when = 0, the denominator 2\9q\ makes 
$^(,(0) undefined. We therefore replace 2\9o\ by 2(e + j^ol) for some e > 0. 
The resulting perturbed version of $0g(0), which is defined for all real is 
no longer a majorizer oi p\{6) as required by the MM theory. Nonetheless, 
we may show that it majorizes a perturbed version of px{0), which may 
therefore be used to define a new objective function Qe{(3) that is similar to 
Q{(3). To this end, we define 

(3.6) p,,e{\e\)=pm)-e r^dt 

Jo £ + t 

and 

d 

(3.7) Qs{P)=e{P)-nY,pxAm)- 

i=i 

The next proposition shows that an MM algorithm may be applied to the 
maximization of Qei(3) and suggests that a maximizer of Qeif3) should be 
close to a maximizer of Q{f3) as long as e is small and Q{P) is not too flat 
in the neighborhood of the maximizer. 
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Proposition 3.2. Suppose that px{-) satisfies the conditions of Propo- 
sition 3.1. For £ > define 



2ie- 



7^ = nPxilP) 1)sgn(/3 ') 



(3.8) '^>e,A(^)=pxAM) + 
Then: 

(a) For any fixed e > 0, 

d 

(3.9) Sk,e{f3) ^ m -nY.^,iu, ^(/3,) 

i=i ' 

minorizes Qe{P) at (3^^'^ . 

(b) As e [d, \Qeifa) — Q{I3)\ uniformly on compact subsets of the 
parameter space. 

Note that when the MM algorithm converges and Sk^lS) is maximized 
by I5^^\ it fohows by straightforward differentiation that 

a/3, - 9/3, — 'e+^C'^- 

Thus, when e is small, the resulting estimator (3 approximately satisfies the 
penalized likelihood equation 

Suppose that (3^ denotes a maximizer of Qeif3) and denotes a maxi- 
mizer of Q{P). In general, it is impossible to bound — /3q|| as a function 
of e because Q(/3) may be quite flat near its maximum. However, suppose 
that QeiP) is upper compact, which means that {/3 : QeiP) > c} is a compact 
subset of R'^ for any real constant c. In this case we may obtain the following 
corollary of Proposition 3.2(b). 

Corollary 3.2. Suppose that f3^ denotes a maximizer of Qe{P)- If 
QeiP) is upper compact for all e > 0, then under the conditions of Proposi- 
tion 3.1 any limit point of the sequence {/3e}ej,o is a maximizer ofQ{(3). 

Both Proposition 3.2(a) and Corollary 3.2 give results as e j 0, which 
suggests the use of an algorithm in which e is allowed to go to zero as 
the iterations progress. Certainly it would be possible to implement such 
an algorithm. However, in this article we interpret these results merely as 
theoretical justification for using the e perturbation in the first place, and 
instead we hold e fixed throughout the algorithms we discuss. The choice of 
this fixed e is the subject of Section 3.3. 
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3.3. Choice of e. Essentially, we want to solve the penalized likelihood 
equation (3.10) for 13 j ^ [recall that p\{Pj) is not difFerentiable at Pj = 0]. 
Suppose, therefore, that convergence is declared in a numerical algorithm 
whenever \dQ{(3)/d(3j\ < r for a predetermined small tolerance r. Our algo- 
rithm accomplishes this by declaring convergence whenever |9Qe(/3)/9/3j| < 
r/2, where £ satisfies 



(3.11 



T 

<2- 



Since p'\{0) is nonincreasing on (0,cx)), (3.6) implies 



dPj dl3j 

for f3j / 0. Thus, to ensure that (3.11) is satisfied, simply take 

This may of course lead to a different value of e each time (3 changes; there- 
fore, in our implementations we fix 

When the algorithm converges, if \dQ((3) / dj3j\ > r, j3j is presumed to be 
zero. In the numerical examples of Section 4 we take r = 10~^. 

3.4. The algorithm. By the ascent property of an MM algorithm, (3^^'^^'^ 
should be defined at the /cth iteration so that 

(3.13) Sfc,,(/3(^^+i))>5,.,,(/3«). 

Note that if (3^^^^^ satisfies (3.13) without actually maximizing Sk,e{f3), 
we still refer to the algorithm as an MM algorithm, even though the second 
M — for "maximize" — is not quite accurate. Alternatively, we could adopt the 
convention used for EM algorithms by Dempster, Laird and Rubin [6] and 
refer to such algorithms as generalized MM, or GMM, algorithms; however, 
in this article we prefer to require extra duty of the label MM and avoid 
further crowding the field of acronym-named algorithms. 

From (3.9) we see that Sk,e{P) consists of two parts, and the sum 
of quadratic functions —^t ik) {13 j) of the components of (3. The latter part 

is easy to maximize directly; thus, the difficulty of maximizing Sk^sif^)^ 
at least attaining (3.13), is solely determined by the form of i{f3). In gen- 
eral, when is easy to maximize, then so is S^^eiP), which distinguishes 



VARIABLE SELECTION VIA MM 



11 



Sk^eiP) from the (e-perturbed) penalized likelihood Qs{P)- Even if £{(3) is 
not easily maximized, at least if it is differentiable, then so is Sk,eiP)j which 
means (3.13) may be attained using standard gradient-based maximization 
methods such as Newton-Raphson. The function Q^{f3), though differen- 
tiable, is not easily optimized using gradient-based methods because it is 
very close to the nondifferentiable function Q{(3). 

Although it is impossible to detail all possible forms of likelihood func- 
tions i{(3) to which these methods apply, we begin with the completely 
general Newton-Raphson-based algorithm 

(3.14) /3(^+i) = /3('=) - afc[v25fc,,(/3('=))]-V5fc,,(/3('=)), 

where V^S'fc^e(-) and VSk^ei') denote the Hessian matrix and gradient vector, 
respectively, and is some positive scalar. Using the definition (3.9) of 
Sk,e{P), algorithm (3.14) becomes 

(3.15) =/3(*^) -afc[V2^(/3('=)) -nEfc]-^[V£(/3('=)) -nEfc/3^'=)], 

where E^. is the diagonal matrix with (j,j)th entry p'xi\Pj''^\+)/{^ + 

We take the ordinary maximum likelihood estimate to be the initial value 

(3^^^ in the numerical examples of Section 4. 

There are some important special cases. First, we consider perhaps the 
simplest case but by far the most important case because of its ubiquity — the 
linear regression model with normal homoscedastic errors, for which 

(3.16) V^(/3) = X^y - 

where X = (xi, . . . ,x„,)^, the design matrix of the regression model, and y 
is the response vector consisting of y^. As pointed out at the beginning of 
Section 2, we omit mention of the error variance parameter here because 
this parameter is to be estimated using standard methods once (3 has been 
estimated. In this case, (3.15) with = 1 gives a closed-form maximizer 
of Sk,e{(^) because Sk^eiP) is exactly a quadratic function. The resulting 
algorithm 

(3.17) /3('=+^) = {X^X + nEk}~Vy 

may be viewed as iterative ridge regression. In the case of LASSO, which 
uses the Li penalty, this algorithm is guaranteed to converge to the unique 
maximizer of QdP) (see Corollary 3.3). 

The slightly more general case of generalized linear models with canonical 
link includes common procedures such as logistic regression and Poisson 
regression. In these cases, the Hessian matrix is 

V^£{f3) = -X^yX, 

where V is a diagonal matrix whose (i, i)th entry is given by (3) and v{-) 
is the variance function. Therefore, the Hessian matrix is negative definite 
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provided that X is of fuh rank. This means that the vector -[V^Sk,s{f3^''^)]~^VS 
in (3.14) is a direction of ascent [unless of course VSfc,,(/3('^)) = 0], which 
guarantees the existence of a positive Ok such that (3^^^^^ satisfies (3.13). 
A simple way of determining at is the practice of step-halving: try Uk = 
for = 0, 1, 2, . . . until the resulting (3^^^^^ satisfies (3.13). For large samples 
in practice, £(/3) tends to be nearly quadratic, particularly in the vicin- 
ity of the MLE (which is close to the penalized MLE for large samples), 
so step-halving does not need to be employed very often. Nonetheless, in 
our experience it is always wise to check that (3.13) is satisfied; even when 
the truth of this inequality is guaranteed as in the linear regression model, 
checking the inequality is often a useful debugging tool. Indeed, whenever 
possible it is a good idea to check that the objective function itself satisfies 
(5e(/9^'^^^^) > Qe(/3'-'^'*); for even though this inequality is guaranteed theo- 
retically by (3.13), in practice many a programming error is caught using 
this simple check. 

In still more general cases, the Hessian matrix V^^(/3) may not be negative 
definite or it may be difficult to compute. If the Fisher information matrix 
/(/3) is known, then —nl{(3) may be used in place of V^£(/3). This leads to 

and the positive definiteness of the Fisher information will ensure that step- 
halving will always lead to an increase in Sk^^{P). 

Finally, we mention the possibility of applying an MM algorithm to a 
penalized partial likelihood function. Consider the example of the Cox pro- 
portional hazards model [4], which is the most popular model in survival 
data analysis. The variable selection methods of Section 2 are extended to 
the Cox model by Fan and Li [8]. Let Tj,Cj and Xj be, respectively, the 
survival time, the censoring time and the vector of covariates for the ith 
individual. Correspondingly, let Z, = min{Tj, Q} be the observed time and 
let 6i = I{Ti < Ci) be the censoring indicator. It is assumed that Tj and Ci 
are conditionally independent given Xj and that the censoring mechanism 
is noninformative. Under the proportional hazards model, the conditional 
hazard function /i(tj|xj) of Tj given Xj is given by 

/i(ti|xj) = /io(tj) exp(xf /3), 

where /io(^) is the baseline hazard function. This is a semiparametric model 
with parameters /io(i) and /3. Denote the ordered uncensored failure times 
by ti< ■•■ <t^, and let (j) provide the label for the item falling at t^, 
so that the covariates associated with the failures are X(i), . . . ,X(^). Let 
Rj = {i:Zi> t^} denote the risk set right before the time t^. A partial 
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likelihood is given by 



N 




exp(xf/3) 



Fan and Li [8] consider variable selection via maximization of the penalized 
partial likelihood 

d 

i=i 

It can be shown that the Hessian matrix of ^p(/3) is negative definite pro- 
vided that X is of full rank. Under certain regularity conditions, it can 
further be shown that in the neighborhood of a maximizer, the partial like- 
lihood is nearly quadratic for large n. 



3.5. Convergence. It is not possible to prove that a generic MM algo- 
rithm converges at all, and when an MM algorithm does converge, there is 
no guarantee that it converges to a global maximum. For example, there are 
well-known pathological examples in which EM algorithms — or generalized 
EM algorithms, as discussed following inequality (3.13) — converge to saddle 
points or fail to converge [18]. Nonetheless, it is often possible to obtain 
convergence results in specific cases. 

We define a stationary point of the function Qeif3) to be any point (3 
at which the gradient vector is zero. Because the differentiable function 
Sk,e{P) is tangent to Qe{P) at the point /3*^'^-* by the minorization property, 
the gradient vectors of Sk,eiP) ™d QeiP) are equal when evaluated at (3^^\ 
Thus, when using the method of Section 3.4 to maximize 5fc,e(/3), we see 
that fixed points of the algorithm (i.e., points with gradient zero) coincide 
with stationary points of Qe(/3). Letting M(/3) denote the map implicitly 
defined by the algorithm that takes P^'^^ to (3^^^^^ for any point /3'-^\ in- 
equality (3.13) states that Sk^ei^iP)} > S^^eiP)- The limit points of the 
set {/3^^^ : A; = 0, 1, 2, . . .} are characterized by the following slightly modified 
version of Lyapunov's theorem [16]. 

Proposition 3.3. Given an initial value let P^^^ =M^{P^^^). If 

QeiP) = Qei^iP)} only for stationary points P of Qs and if P* is a limit 
point of the sequence {P^^^} such that M{P) is continuous at P* , then P* 
is a stationary point of Qe{P)- 



Equation (3.14) with ak = 1 gives 

(3.18) M(/3(^^) = /3(^) - {V'^Sk,e{P^^^)]~^^Qe{P^^^), 
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where (3.18) uses the fact that VSk^ei/S''''^) = VQe{P^^^)- As discussed in [16, 
17], the derivative of M{j3) gives insight into the local convergence properties 
of the algorithm. Suppose that VQsiP^^^) = 0, so /3^^'^ is a stationary point. 
In this case, differentiating (3.18) gives 

VM(/3('=)) = {V25fc,,(/3('^))}-^{V25fc,,(/3W) - V2q,(/3('=))}. 

It is possible to write V"^ Sk,e{p'^''^) - V^Qe{(3^^^) in closed for m as nAu, 
where = diag{a(/?|'^'*), . . . , a{p'^J'^)} and 

Under the conditions of Proposition 3.2, p-((|t|+) < and thus nA^ is neg- 
ative semidefinite, a fact that may also be interpreted as a consequence 
of the minorization of Qe{(^) by S^^eiP)- Furthermore, S/'^£(j3^^^) is of- 
ten negative definite, as pointed out in Section 3.4, which implies that 
V'^Sk^eiP^'^^) is negative definite. This fact, together with the fact that 
V^5'fc,e(/3'''^^) — V'^Qs{(3^''^) is negative semidefinite, implies that the eigen- 
values of VM(/3(*^)) are all contained in the interval [0,1) [13]. Ostrowski's 
theorem [22] thus implies that the MM algorithm defined by (3.18) is locally 
attracted to P^''^ and that the rate of convergence to (3^''^ in a neighborhood 
of (3^^^ is linear with rate equal to the largest eigenvalue of M{(5^^'^). In 
other words, if /3* is a stationary point and p < 1 is the largest eigenvalue 
of VM(/3*), then for any 5 > such that 0<p — 5</3 + 5<l, there exists 
a neighborhood containing /3* such that for all (3 £ Ng, 

{p-6)\\P-P*\\<\\M{(3)-(3*\\<{p + 6)\\l3-l3*\\. 

Further details about the rate of convergence for similar algorithms may be 
found in [16, 17]. 

Lyapunov's theorem (Proposition 3.3) gives a necessary condition for a 
point to be a limit point of a particular MM algorithm. To conclude this 
section, we consider a sufficient condition for the existence of a limit point. 
Suppose that the function Qe{f3) is upper compact, as defined in Section 3.2. 
Then given the initial parameter vector j3^^\ the set B = {(3 £ : Qe{f3) > 
(5(;3(°))} is compact; furthermore, by (3.6) and (3.7), Qeil3) > Q{(3) so that 
B contains the entire sequence {/3^*^^}^o- This guarantees that the sequence 
has at least one limit point, which must therefore be a stationary point of 
Qeif3) by Proposition 3.3. If in addition there is no more than one stationary 
point — for example, if Qei(3) is strictly concave — then we may conclude that 
the algorithm must converge to the unique stationary point. 

Upper compactness of Qe{f3) follows as long as Qe{P) —oo whenever 
1 1/3 1 1 ^ oo; this is often not difficult to prove for specific examples. In the 
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particular case of the Li penalty (LASSO), strict concavity also holds as 
long as i{l3) is strictly concave, which implies the following corollary. 

Corollary 3.3. Ifpx{\6\) = X\6\ andi{(3) is strictly concave and upper 
compact, then the MM algorithm of (3.15) gives a sequence {/fl^*"^} converg- 
ing to the unique maximizer of Qs{(3) for any e > 0. 

In particular. Corollary 3.3 implies that using our algorithm with the 
e-perturbed LASSO penalty guarantees convergence to the maximum pe- 
nalized likelihood estimator for any full-rank generalized linear model or 
(say) Cox proportional hazards model. However, strict concavity of Qe{P) 
is not typical for other penalty functions presented in this article in light of 
the requirement in Proposition 3.2 that px{-) be concave — and hence that 
—px{-) be convex — on (0,oo). This fact means that when an MM algorithm 
using some penalty function other than Li converges, then it may converge 
to a local, rather than a global, maximizer of Qs{P)- This can actually be 
an advantage, since one might like to know if the penalized likelihood has 
multiple local maxima. 

4. Numerical examples. Since Fan and Li [7] have already compared the 
performance of LASSO with SCAD using a local quadratic approximation 
and other existing methods, in the following four numerical examples we 
focus on assessing the performance of the proposed algorithms using the 
SCAD penalty (2.2). Namely, we compare the unmodified LQA to our mod- 
ified version (both using SCAD), where e is chosen according to (3.12) with 
r = 10~^. For SCAD, p^(0+) = A, and this tuning parameter A is chosen 
using generalized cross-validation, or GCV [5] . As suggested by Fan and Li 
[7], we take a = 3.7 in the definition of SCAD. 

The Newton-Raphson algorithm (3.15) enables a standard error estimate 
via a sandwich formula: 

(4.1) c5^(^) = {V^m - nEkr^cmiVSkAPm^^m - nE^r^ 
where 

1 " 

c5v{V5fc,,(/3)} = - 5^[V^i(/3) - nEkP][Vim - nE.pf 
1=1 



1 " 



J2Vi,i(3)-nEkf3 



1=1 



1 

Y,Vii{fi)-nEk(3 



1 " 



1=1 
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Naturally, another estimate may be formed if —nl{(3) is substituted for 
V^(/3) in (4.1). Fan and Peng [10] establish the consistency of this sandwich 
formula for related problems, and their method of proof may be adapted to 
this situation, though we do not do so in this article. 

For the simulated examples, Examples 1-3, we compare the performance 
of the proposed procedures along with AIC and BIC in terms of model error 
and model complexity. With /x(x) =£'(y|x), model error (ME) is defined 
as E{(l{-x) — ^(x)}^, where the expectation is taken with respect to a new 
observation x. The MEs of the underlying procedures are divided by that 
of the ordinary maximum likelihood estimate, so we report relative model 
error (RME). 

Example 1 {Linear regression). In this example we generated 500 data 
sets, each of which consists of 100 observations from the model 

y = x^/3 + e, 

where (3 \s a. twelve-dimensional vector whose first, fifth and ninth compo- 
nents are 3, 1.5 and 2, respectively, and whose other components equal 0. 
The components of x and e are standard normal and the correlation be- 
tween Xi and Xj is taken to be p. In our simulation we consider three cases: 
yO = 0.1, p = 0.5 and p = 0.9. In this case there is a closed form for the model 
error, namely ME(^) = {0 — /3)'^cov(x)(/3 — (3). The median of the relative 
model error (RME) over 500 simulated data sets is summarized in Table 1. 
The average number of coefficients is also reported in Table 1, in which 
the column labeled "C" gives the average number of coefficients, of the nine 
true zeros, correctly set to zero and the column labeled "I" gives the average 
number of the three true nonzeros incorrectly set to zero. 

In Table 1 New and LQA refer to the newly proposed algorithm and 
the local quadratic approximation algorithm of Fan and Li [7]. AIC and 



Table 1 

Relative model errors for linear regression 



Method 


RME 


Zeros 




RME 


Zeros 




RME 


Zeros 




Median 


C 


I 


Median 


C 


I 


Median 


C 


I 


P 


= 0.9 




P 


= 0.5 




P = 


0.1 




New 


0.437 


8.346 





0.215 


8.708 





0.238 


8.292 





LQA 


0.590 


7.772 





0.237 


8.680 





0.269 


8.272 





BIC 


0.337 


8.644 





0.335 


8.652 





0.328 


8.656 





AIC 


0.672 


7.358 





0.673 


7.324 





0.668 


7.374 





Oracle 


0.201 


9.000 





0.202 


9 





0.211 


9 
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Table 2 



Standard deviations and standard errors of fi\ in the linear regression model 



SD 



SE (std(SE)) 



SD 



SE (std(SE)) 



SD 



SE (std(SE)) 



Method 



p = 0.9 



p = 0.5 



p = 0.1 



LSE 


0.339 


0.322 (0.035) 


0.152 


0.144 (0.016) 


0.114 


0.109 (0.012) 


New 


0.303 


0.260 (0.036) 


0.129 


0.120 (0.017) 


0.104 


0.098 (0.014) 


LQA 


0.315 


0.265 (0.037) 


0.128 


0.120 (0.017) 


0.105 


0.098 (0.014) 


BIG 


0.295 


0.269 (0.028) 


0.133 


0.124 (0.013) 


0.105 


0.101 (0.010) 


AIC 


0.322 


0.278 (0.029) 


0.145 


0.128 (0.013) 


0.109 


0.101 (0.010) 


Oracle 


0.270 


0.264 (0.027) 


0.126 


0.124 (0.013) 


0.103 


0.102 (0.010) 



BIC stand for the best subset variable selection procedures that minimize 
AIC scores and BIC scores. Finally, "Oracle" stands for the oracle estimate 
computed by using the true model y = f3ixi + + /Jgxg + e. When the 
correlation among the covariates is small or moderate, we see that the new 
algorithm performs the best in terms of model error and LQA also performs 
very well; their RMEs are both very close to those of the oracle estimator. 
When the covariates are highly correlated, the new algorithm outperforms 
LQA in terms of both model error and model complexity. The performance 
of BIC and AIC remains almost the same for the three cases in this example; 
Table 1 indicates that BIC performs better than AIC. 

We now test the accuracy of the proposed standard error formula. The 
standard deviation of the estimated coefficients for the 500 simulated data 
sets, denoted by SD, can be regarded as the true standard deviation except 
for Monte Carlo error. The average of the estimated standard errors for 
the 500 simulated data sets, denoted by SE, and their standard deviation, 
denoted by std(SE), gauge the overall performance of the standard error 
formula. Table 2 only presents the SD, SE and std(SE) of /3i. The results 
for other coefficients are similar. In Table 2, LSE stands for the ordinary 
least squares estimate; other notation is the same as that in Table 1. The 
differences between SD and SE are less than twice std(SE), which suggests 
that the proposed standard error formula works fairly well. However, the 
SE appears to consistently underestimate the SD, a common phenomenon 
(see [15]), so it may benefit from some slight modification. 

Example 2 (Logistic regression). In this example we assess the perfor- 
mance of the proposed algorithm for a logistic regression model. We gen- 
erated 500 data sets, each of which consists of 200 observations, from the 
logistic regression model 



(4.2) 



Mx)^{P(y = l|x)} 



exp(x /3) 



1 + exp(x^/3) 
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where /3 is a nine-dimensional vector whose first, fourth and seventh com- 
ponents are 3, 1.5 and 2 respectively, and whose other components equal 0. 
The components of x are standard normal, where the correlation between Xi 
and Xj is p. In our simulation we consider two cases, p = 0.25 and p = 0.75. 
Unlike the model error for linear regression models, there is no closed form 
of ME for the logistic regression model in this example. The ME, summa- 
rized in Table 3, is estimated by 50,000 Monte Carlo simulations. Notation 
in Table 3 is the same as that in Table 1. It can be seen from Table 3 that 
the newly proposed algorithm performs better than LQA in terms of model 
error and model complexity. We further test the accuracy of the standard 
error formula derived by using the sandwich formula (4.1). The results are 
similar to those in Table 2 — the proposed standard error formula works fairly 
well — so they are omitted here. 

Best variable subset selection using the BIC criterion is seen in Table 3 
to perform quite well relative to other methods. However, best subset selec- 
tion in this example requires an exhaustive search over all possible subsets, 



Table 3 

Relative model errors for logistic regression 



Method 


RME 


Zeros 




RME 


Zeros 


Median 


C 


I 


Median 


C 


I 


P = 


0.25 




P 


= 0.75 




New 


0.277 


5.922 





0.528 


5.534 


0.222 


LQA 


0.368 


5.728 





0.644 


4.970 


0.090 


BIC 


0.304 


5.860 





0.399 


5.796 


0.304 


AIC 


0.673 


4.930 





0.683 


4.860 


0.092 


Oracle 


0.241 


6 





0.216 


6 






Table 4 

Computing time for the logistic model (seconds per simulation) 



p 


d 


New 


LQA 


BIC 


AIC 


0.25 


8 


0.287 


0.142 


2.701 


2.699 




9 


0.316 


0.151 


5.702 


5.694 




10 


0.348 


0.180 


11.761 


11.754 




11 


0.424 


0.199 


26.702 


26.576 


0.75 


8 


0.395 


0.199 


2.171 


2.166 




9 


0.438 


0.205 


4.554 


4.553 




10 


0.452 


0.225 


9.499 


9.518 




11 


0.532 


0.244 


19.915 


19.959 
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and therefore it is computationally expensive. The methods we propose can 
dramatically reduce computational cost. To demonstrate this point, ran- 
dom samples of size 200 were generated from model (4.2) with (3 being 
a d-dimensional vector whose first, fourth and seventh components are 3, 
1.5 and 2, respectively, and whose other components equal 0. Table 4 de- 
picts the average computing time for each simulation with (i = 8,...,ll, 
and indicates that computing times for the BIC and AIC criteria increase 
exponentially with the dimension (i, making these methods impractical for 
parameter sets much larger than those tested here. Given the increasing 
importance of variable selection problems in fields like genetics and data 
mining, where the number of variables is measured in the hundreds or even 
thousands, efficiency of algorithms is an important consideration. 

Example 3 {Cox model). We investigate the performance of the pro- 
posed algorithm for the Cox proportional hazard model in this example. We 
simulated 500 data sets each for sample sizes n = 40, 50 and 60 from the 
exponential hazard model 

(4.3) /i(t|x) = exp(x^/3), 

where (3 = (0.8,0,0,1,0,0,0.6,0)"^. This model is used in the simulation 
study of Fan and Li [8]. The x„'s were marginally standard normal and 
the correlation between Xu and x^ was with p = 0.5. The distribution 

of the censoring time is an exponential distribution with mean [/exp(x-^/3Q), 
where U is randomly generated from the uniform distribution over [1,3] for 
each simulated data set, so that 30% of the data are censored. Here Pq = (3 
is regarded as a known constant so that the censoring scheme is noninforma- 
tive. The model error S{/i(x) — /i(x)}^ is estimated by 50,000 Monte Carlo 
simulations and is summarized in Table 5. The performance of the newly 
proposed algorithm is similar to that of LQA. Both the new algorithm and 
LQA perform better than best subset variable selection with the AIC or BIC 
criteria. Note that the BIC criterion is a consistent variable selection crite- 
rion. Therefore, as the sample size increases, its performance becomes closer 
to that of the nonconcave penalized partial likelihood procedures. We also 
test the accuracy of the standard error formula derived by using the sand- 
wich formula (4.1). The results are similar to those in Table 2; the proposed 
standard error formula works fairly well. 

As in Example 2, we take /3 to be a d-dimensional vector whose first, 
fourth and seventh components are nonzero (0.8, 1.0, and 0.6, resp.) and 
whose other components equal 0. Table 6 shows that the proposed algorithm 
and LQA can dramatically save computing time compared with AIC and 
BIC. 

As a referee pointed out, it is of interest to investigate the performance of 
variable selection algorithms when the "full" model is misspecified. Model 
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Table 5 

Relative model errors for the Cox model 



Method 


RME 


Zeros 


RME 


Zeros 


RME 


Zeros 


Median 


C 


I 


Median 


C 


I 


Median 


C 


I 




n = 40 






n = 50 




n 


= 60 




New 


0.173 


4.790 


1.396 


0.288 


4.818 


1.144 


0.324 


4.882 


0.904 


LQA 


0.174 


4.260 


0.626 


0.296 


4.288 


0.440 


0.303 


4.332 


0.260 


BIG 


0.247 


4.492 


0.606 


0.337 


4.564 


0.442 


0.344 


4.624 


0.272 


AIC 


0.470 


3.880 


0.358 


0.551 


3.948 


0.240 


0.577 


3.986 


0.160 


Oracle 


0.103 


5 





0.152 


5 





0.215 


5 






misspecification is a concern for all variable selection procedures, not merely 
those discussed in this article. To address this issue, we generated a random 
sample from model (4.3) with /3 = (0.8, 0, 0, 1, 0, 0, 0.6, 0, /Jg, /3io)^, where 
/3g = /3io = 0.2 or 0.4. The first eight components of x are the same as those 
in the above simulation. We take xg = (xf — 1)/ \/2 and xiq = (x^ — 1)/ \/2- In 
our model fitting, our "full" model only uses the first eight components of x. 
Thus, we misspecified the full model by ignoring the last two components. 
Based on the "full" model, variable selection procedures are carried out. 
The oracle procedure uses (xi, X4, X7, xg, xio)"^ to fit the data. The model 
error E{fi{x.) — ^(x)}^, where /x(x) is the mean function of the true model 
including all ten components of x, is estimated by 50,000 Monte Carlo simu- 
lations and is summarized in Table 7, from which we can see that all variable 
selection procedures outperform the full model. This implies that selecting 



Table 6 

Computing time for the Cox model (seconds per simulation) 



n 


d 


New 


LQA 


BIC 


AIC 


40 


8 


0.248 


0.147 


0.415 


0.418 




9 


0.258 


0.140 


0.843 


0.843 




10 


0.299 


0.149 


1.711 


1.712 




11 


0.327 


0.162 


3.680 


3.675 


50 


8 


0.320 


0.197 


0.588 


0.591 




9 


0.364 


0.200 


1.218 


1.225 




10 


0.406 


0.217 


2.532 


2.531 




11 


0.466 


0.211 


5.189 


5.171 


60 


8 


0.417 


0.263 


0.820 


0.827 




9 


0.474 


0.268 


1.795 


1.768 




10 


0.513 


0.279 


3.454 


3.454 




11 


0.574 


0.288 


7.219 


7.162 
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significant variables can dramatically reduce both model error and model 
complexity. Prom Table 7 we can see that both the newly proposed algo- 
rithm and LQA significantly reduce the model error of best subset variable 
selection using AIC or BIC. 

Example 4 {Environmental data). In this example, we illustrate the 
proposed algorithm in the context of analysis of an environmental data set. 
This data set consists of the number of daily hospital admissions for circu- 
lation and respiration problems and daily measurements of air pollutants. 
It was collected in Hong Kong from January 1, 1994 to December 31, 1995 
(courtesy of T. S. Lau). Of interest is the association between levels of pol- 
lutants and the total number of daily hospital admissions for circulatory 
and respiratory problems. The response is the number of admissions and 
the covariates Xi to X3 are the levels (in fig/m"^) of the pollutants sulfur 
dioxide, nitrogen dioxide and dust. Because the response is count data, it 
is reasonable to use a Poisson regression model with mean /u(x) to analyze 
this data set. To reduce modeling bias, we include all linear, quadratic and 
interaction terms among the three air pollutants in our full model. Since 
empirical studies show that there may be a trend over time, we allow an 
intercept depending on time, the date on which observations were collected. 
In other words, we consider the model 

log{/i(x)} = /3o{t) + PiX, + 132X2 + PsXs + PiXf + /35XI 
+ (3^X1 + pjXiX2 + PsXiX^ + f3gX2X^. 



Table 7 

Relative model errors for misspecified Cox models 



(/39,/3io) 


Method 


RME 


# of zeros 


RME 


# of zeros 


RME 


# of zeros 


n 


= 40 


n 


= 50 


n 


= 60 


(0.2, 0.2) 


New 


0.155 


6.276 


0.259 


6.142 


0.298 


5.906 




LQA 


0.146 


4.992 


0.251 


4.822 


0.294 


4.662 




BIC 


0.244 


5.186 


0.356 


5.020 


0.487 


4.932 




AIC 


0.441 


4.336 


0.564 


4.162 


0.657 


4.104 




Oracle 


0.254 


5 


0.194 


5 


0.239 


5 


(0.4, 0.4) 


New 


0.205 


6.544 


0.323 


6.384 


0.479 


6.132 




LQA 


0.197 


5.166 


0.324 


4.924 


0.469 


4.806 




BIC 


0.345 


5.352 


0.477 


5.156 


0.593 


5.110 




AIC 


0.560 


4.444 


0.670 


4.236 


0.755 


4.240 




Oracle 


0.268 


5 


0.228 


5 


0.273 


5 
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We further parameterize the intercept function by a cubic spline 

5 

Po{t) = Poo + Poit + P02t^ + P03t^ + /50{i+4) {t - kj)l, 

i=i 

where the knots kj are chosen to be the 10th, 25th, 50th, 75th and 90th 
percentiles of t. Thus, we are dealing with a Poisson regression model with 
18 variables. To avoid numerical instability, the time variable and the air 
pollutant variables are standardized. 

Generalized cross-validation is used to select the tuning parameter A for 
the new algorithm using SCAD. The plot of the GCV scores against A is 
depicted in Figure 2(a), and the selected A equals 0.1933. In Table 8, we 
see that all linear terms are very significant, whereas the quadratic terms of 
SO2 and dust and the 5*02 x dust interaction are not significant. The plot of 
the estimated intercept function /?o(i) is depicted in Figure 2(b) along with 
the estimated intercept function under the full model. The two estimated 
intercept functions are almost identical and capture the time trend very well, 
but the new algorithm saves two degrees of freedom by deleting the and 
{t — k^)^ terms from the intercept function. 

For the purpose of comparison, SCAD using the LQA is also applied to 
this data set. The plot of the GCV scores against A is also depicted in Figure 
2(a), and the selected A again equals 0.1933. In this case, not only the and 




Fig. 2. In (a) the solid line indicates the GCV scores for SCAD using the new algorithm, 
and the dashed line indicates the same thing for the LQA algorithm. In (b) the solid and 
thick dashed lines that nearly coincide indicate the estimated intercept functions for the 
new algorithm and the full model, respectively; the dash-dotted line is for LQA. The dots 
are the full-model residuals log(y) — x"^/3j^^le- 



VARIABLE SELECTION VIA MM 



23 



Table 8 

Estimated coefficients and their standard errors 



Covariate 


MLE 


New 


LQA 


SO2 


0.0082 (0.0041) 


0.0043 (0.0024) 


0.0090 (0.0029) 


NO2 


0.0238 (0.0051) 


0.0260 (0.0037) 


0.0311 (0.0033) 


Dust 


0.0195 (0.0054) 


0.0173 (0.0037) 


0.0043 (0.0026) 


sol 


-0.0029 (0.0013) 


(0.0009) 


-0.0025 (0.0010) 


NOl 


0.0204 (0.0043) 


0.0118 (0.0029) 


0.0157 (0.0031) 


Dust^ 


0.0042 (0.0025) 


(0.0015) 


0.0060 (0.0018) 


SO2 X NO2 


-0.0120 (0.0039) 


-0.0050 (0.0021) 


-0.0074 (0.0022) 


SO 2 X dust 


0.0086 (0.0047) 


(< 0.00005) 


(NA) 


NO2 X dust 


-0.0305 (0.0061) 


-0.0176 (0.0037) 


-0.0262 (0.0041) 



NOTE: NA stands for "Not available.^ 



{t — k^)^ terms but also the t and terms are deleted from the intercept 
function. The estimated intercept function is the dash-dotted curve in Figure 
2(b), and now the resulting estimated intercept function looks dramatically 
different from the one estimated under the full model, and furthermore it 
appears to do a poor job of capturing the overall trend. The LQA estimates 
shown in Table 8 are quite different from those obtained using the new 
algorithm, even though they both use the same tuning parameter. Recall 
that LQA suffers the drawback that once a parameter is deleted, it cannot 
reenter the model, which appears to have had a major impact on the LQA 
model estimates in this case. Standard errors in Table 8 are available for 
the deleted coefficients in the new model, but not LQA because the two 
algorithms use different deletion criteria. 

5. Discussion. We have shown how a particular class of MM algorithms 
may be applied to variable selection. In modifying previous work on vari- 
able selection using penalized least squares and penalized likelihood by Fan 
and Li [7, 8, 9] and Cai, Fan, Li and Zhou [3], we have shown how a slight 
perturbation of the penalty function can eliminate the possibility of mis- 
takenly excluding variables too soon while simultaneously enabling certain 
convergence results. While the numerical tests given here deal with four 
very diverse models, the range of possible applications of this method is 
even broader. Generally speaking, the MM algorithms of this article may be 
applied to any situation where an objective function — whether a likelihood 
or not — is penalized using a penalty function such as the one in (2.1). If the 
goal is to maximize the penalized objective function and p\{-) satisfies the 
conditions of Proposition 3.1, then an MM algorithm may be applicable. If 
the original (unpenalized) objective function is concave, then the modified 
Newton-Raphson approach of Section 3.4 holds promise. In Section 2 we list 
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several distinct classes of penalty functions in the literature that satisfy the 
conditions of Proposition 3.1, but there may also be other useful penalties 
to which our method applies. 

The numerical tests of Section 4 indicate that the modified SCAD penalty 
we propose performs well on simulated data sets. This algorithm has compa- 
rable relative model error (sometimes quite a bit better, sometimes slightly 
worse) to BIC and the unmodified SCAD penalty implemented using LQA, 
and it consistently outperforms AIC. Our proposed algorithm tends to result 
in more parsimonious models than the LQA algorithm, typically identifying 
more actual zeros correctly but also eliminating too many nonzero coefh- 
cients. This fact is surprising in light of the drawback that LQA can exclude 
variables too soon during the iterative process, a drawback that our algo- 
rithm corrects. The particular choice of e, addressed in Section 3.3, may 
warrant further study because of its influence on the complexity of the final 
model chosen. 

An important difference between our algorithm and both AIC and BIC is 
the fact that the latter two methods are often not computationally efficient, 
with computing time scaling exponentially in the number of candidate vari- 
ables whenever it becomes necessary to search exhaustively over the whole 
model space. This means that in problems with hundreds or thousands of 
candidate variables, AIC and BIC can be difficult if not impossible to imple- 
ment. Such problems are becoming more and more prevalent in the statisti- 
cal literature as topics such as microarray data and data mining increase in 
popularity. 

Finally, we have seen in one example involving the Cox proportional haz- 
ards model that both our method and LQA perform well when the model is 
misspecified, even outperforming the oracle method for samples of size 40. 
Although questions of model misspecification are largely outside the scope 
of this article, it is useful to remember that although simulation studies can 
aid our understanding, the model assumed for any real data set is only an 
approximation of reality. 



Proof of Proposition 3.1. The proof uses the following lemma. 

Lemma A.l. Under the assumptions of Proposition 3.1, p'-^{0+) / {e + 9) 
is a nonincreasing function of 9 for any nonnegative e. 

The proof of the lemma is immediate: Both p'x{9) and {e + 6)~^ are positive 
and nonincreasing on (0, oo) , so their product is nonincreasing. For any 9 > 0, 
we see that 



APPENDIX 



(A.l) 




VARIABLE SELECTION VIA MM 



25 



Furthermore, since px{-) is nondecreasing and concave on (0,co) (and con- 
tinuous at 0), it is also continuous on [0, cxd). Thus, <&0o(^) ~Pa(|^|) is an 
even function, piecewise differentiable and continuous everywhere. Taking 
e = in Lemma A.l, (A.l) imphes that ^0^(6) — Pa(|^|) is nonincreasing 
for 9 £ (0, l^ol) and nondecreasing for 9 £ (l^oli oo); this function is therefore 
minimized on (0,oo) at \9q\. Since it is clear that <l>e,j(|0o|) =Pa(|^o|) and 
^do{~\(^o\) = Pxi~\0o\)j condition (3.3) is satisfied for 6*0 = ±|6'o|. □ 

Proof of Proposition 3.2. For part (a), it suffices to show that 
^6o,e{G) majorizes PA,e(|^|) at ^o- It follows by definition that ^gQ^^{9o) = 
Px,ei\^o\)- Furthermore, Lemma A.l shows as in Proposition 3.1 that the 
even function <I>5)(, ^(0) — PA,e(|^|) is decreasing on (0, |^o|) and increasing on 
(I^oIjCo), giving the desired result. 

To prove part (b), it is sufficient to show that |pA,e(|^|) — Pa(|^|)| — * 
uniformly on compact subsets of the parameter space as e | 0. Since p';\(^+) 
is nonincreasing on (0,oo), 



\px,em)-Pxm)\<eiog 



1 + — 

e 



Pa(0+), 



and because p^(O-l-) < oo, the right-hand side of the above inequality tends 
to uniformly on compact subsets of the parameter space as e | 0. □ 

Proof of Corollary 3.2. Let ^ denote a maximizer of Q{P) and put 
B = {/3 e : QeoiP) > Q0)} for some fixed eo > 0. Then B is compact and 
contains all 0^ for < e < Eq. Thus, Proposition 3.2 shows that for e < Eq, 

< \Qm-Qem + \Qe{K)-QCPe)\ 

If (3* is a limit point of {l^e^eio-, then by the continuity of Q{P), \Q{P*) — 
Q{f3) \ = and so /3* is a maximizer of Q{f3)- □ 



Proof of Proposition 3.3. Given an initial value /3'^°\ let /S^'^^ = 
M^{f3^^'^) for A; > 1; that is, {P^^^} is the sequence of points that the MM 
algorithm generates starting from P^^^ . Let A denote the set of limit points 
of this sequence. For any (3* G A, passing to a subsequence we have /3^'^"-' 
(3*. The quantity Qe{P^''"^), since it is increasing in n and bounded above, 
converges to a limit as n ^ oo. Thus, taking limits in the inequalities 
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gives Qe(/9*) = Qe{liTan-*c<y M{Pi^^)} , assuming this limit exists. Of com'se, 
if M(/3) is continuous at /3*, then we have Qe{l3*) = Qe{M{l3*)}, which 
imphes that (3* is a stationary point of Q^{f3). 

Note that A is not necessarily nonempty in the above proof. However, we 
know that each jd^'^^ lies in the set {f3:Qs{f3) > Q£{f3i)}, so if this set is 
compact, as is often the case, we may conclude that A is indeed nonempty. 
□ 
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